1 Introduction

Partial least squares discriminant analysis (PLS-DA) is a statistical method that combines the techniques of partial least squares (PLS) regression and linear discriminant analysis (LDA).

PLS-DA works by first finding the linear combinations of the predictor variables (X) that are most correlated with the response variable (Y). These linear combinations are called latent variables. The latent variables are then used to train a LDA model, which can be used to classify new samples.

PLS-DA has several advantages over other classification methods. First, it is relatively robust to collinearity, which is a problem that can occur in data sets with many correlated variables. Second, PLS-DA can handle data sets with a large number of predictor variables. Third, PLS-DA can be used to visualize data, which can be helpful for understanding the relationships between the predictor variables and the response variable.

2 Load example dataset

library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
data(breast.TCGA)

# Extract training data and name each data frame
# There are 3 dataframes (3 omics of the same samples), stored as list
# In each dataframe, the rows represent samples.
# Columns represent feature measurments (genes, miRNA, protein)
X <- list(mRNA = breast.TCGA$data.train$mrna, 
          miRNA = breast.TCGA$data.train$mirna, 
          protein = breast.TCGA$data.train$protein)

Y <- breast.TCGA$data.train$subtype
table(Y)
## Y
## Basal  Her2  LumA 
##    45    30    75

3 PLSDA

3.1 plsda()

We can use the plsda() function to find combinations of features that discriminate the groups.
The inputs are a matrix of values (sample-rows by feature-columns, X), together with a vector of the groups (Y).
We can also choose the number of components, and whether to add scaling and logratio (as seen in Session 1 with pca()).

plsda.mRNA <- plsda(X$mRNA, Y=Y, ncomp = 3, scale = TRUE, logratio = "CLR")

# plot the result
plotIndiv(plsda.mRNA, ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA", 
          legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)

plotIndiv() applied to plsda objects projects the samples into plsda component subspace; components 1 and 2 by default. We look at other components, which might show further separation of groups in complex datasets.

plotIndiv(plsda.mRNA, comp = c(1,3), ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA", 
          legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)

  • From the look of these plots, Component 3 is not really adding any further value in separating the groups.
  • Component 1 seems useful for separating LumA < Her2 < Basal
  • Component 2 seems useful for separating (LumA or Basal) < Her2

PLSDA with a single component may be sufficient for some experiments (e.g. a binary group comparison of healthy vs disease samples), whereas experiments with many groups might be optimally separated using more components.

3.2 Loadings

Each component is a composite of all features with a weight coefficient (i.e. loadings). Loadings can be displayed using plotLoadings() and select

plotLoadings(plsda.mRNA, comp = 1, contrib = "max", method = "median")

plotLoadings(plsda.mRNA, comp = 2, contrib = "max", method = "median")

loadings.comp1 <- selectVar(plsda.mRNA, comp=1)
loadings.comp2 <- selectVar(plsda.mRNA, comp=2)
head(loadings.comp1$value)
##          value.var
## ZNF552  -0.1405038
## SLC43A3  0.1373226
## PREX1   -0.1280330
## FMNL2    0.1274558
## CCNA2    0.1256307
## LMO4     0.1222428
head(loadings.comp2$value)
##           value.var
## CDK18    -0.2059362
## TP53INP2  0.2032781
## TRIM45   -0.1901160
## STAT5A   -0.1774087
## PLEKHA4  -0.1667928
## ZNF37B   -0.1649502

4 Sparse PLSDA tuning

Increasing the number of components improves discrimination up to a point, after which no further value is added.

Similarly, the number of features in each component generally increases with the initial highly weighted features, after which adding more features gives minimial or negative impact (due to noise). It is recommended to optimise the number of components and the number of features in each component, using a tuning procedure.

General advice for choosing parameters in the tuning procedure:

  • When the dataset has <10 samples use “loo” (leave-one-out) method of validation
  • With >10 samples, use “Mfold” and set folds = N (where N is up max ~10, and at minimum allowing ~5 samples per group).
  • nrepeat: start with a low number for initial testing (<10) and increase it for proper tuning once all the settings are decided (>50). This can scale to become a very slow step with complicated data and many repetitions (e.g. 30 mins on a laptop).
  • cpus: the number of processor cores you want to use in parallel (usually your total minus 1 is ideal).

The next two parameters determine how error/success the rate is defined, which will depend on the complexity of the data:

  • measure (value of error rate to optimise down): “BER” is balanced error rate, which is a particularly good choice for experiments with unequal experiment groups. Alternatives are “overall” or “AUC” (area under curve).
  • dist: options are “max.dist”, “centroids.dist”, or “mahalanobis.dist”, in increasing order of complexity. Usually the simpler distance options are preferable.
test.keepX <- 2^seq(1:7)
test.keepX
## [1]   2   4   8  16  32  64 128
set.seed(123)

tune.splsda.mRNA <- tune.splsda(X$mRNA, Y = Y, ncomp = 3, validation = "Mfold", 
                                folds = 5, test.keepX = test.keepX,
                                dist = "centroids.dist",
                                scale = TRUE, logratio = "CLR", 
                                nrepeat = 30, cpus = 3, progressBar = FALSE, 
                                measure = "BER")
  
plot(tune.splsda.mRNA)

Inspect the performance (BER), which you want to minimise down, while increasing the number of components, and increasing the number of features used.

A few observations:

  • In this example, component 1 alone (blue) seems optimised at 64 genes.
  • Adding a component: 1+2 (orange) seems optimised at 12 genes.
  • Adding a component: 1+2+3 (grey) does not clearly add any further benefit.
  • So an ideal model might use two components, with 64 and 12 genes, respectively.
  • Considering also that fewer genes might be desirable for a minimal gene signature we could also choose a lower number of genes with only a slight loss in performance (e.g. 32+12 genes).
  • We could also do a more detailed ‘scan’ of possible values (test.keepX).

5 Tuned sPLSDA

Let’s try a slightly simplified sparse PLSDA model, using two components with 32 and 16 genes, respectively.

tuned.splsda.mRNA <- splsda(X$mRNA, Y = Y, ncomp = 2, keepX = c(32, 16))
  
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 32 + 16 genes", 
          ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6, 
          legend.title = "Cancer\nSubtype", legend = TRUE)

# we can re-plot with a background-shading of the group boundary calls

background = background.predict(tuned.splsda.mRNA, comp.predicted=2, dist = "centroids.dist")
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 32 + 16 genes", 
          ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6, 
          legend.title = "Cancer\nSubtype", legend = TRUE, 
          background = background)

The separation between groups looks as good or better than the initial PLSDA model, and requires fewer features.

6 Feature Loadings

We can inspect the top-ranked chosen genes and their loadings.

plotLoadings(tuned.splsda.mRNA, comp = 1, contrib = "max", method = "median", size.name = 0.5)

plotLoadings(tuned.splsda.mRNA, comp = 2, contrib = "max", method = "median")

loadings.comp1 <- selectVar(tuned.splsda.mRNA, comp=1)
loadings.comp2 <- selectVar(tuned.splsda.mRNA, comp=2)
head(loadings.comp1$value)
##         value.var
## ZNF552 -0.4169182
## KDM4B  -0.3691667
## PREX1  -0.3014427
## LRIG1  -0.2998513
## CCNA2   0.2802598
## TTC39A -0.2671062
head(loadings.comp2$value)
##           value.var
## TP53INP2  0.5919659
## CDK18    -0.5623317
## NDRG2    -0.2928047
## TRIM45   -0.2842161
## STAT5A   -0.2480526
## PLEKHA4  -0.2210436

7 Feature Stability

Stability of the selected features can be reviewed with the perf() function, which estimates the mean squared error of prediction (MSEP), R2, and Q2 metrics.

These results give an indication of how frequently the features are selected, if the procedure is repeated in folds.

8 Heatmap

cim() is a plotting function in mixOmics, which can generate a basic heat map of expression (or other values, depending on context). cim is an abbreviation of Clustered Image Map.

Note that saving the output (large image) as a file is recommended, rather than viewing in RStudio directly. Alternatively, you can open a new window for graphing with X11().

cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA")
cim1
cim1

We can try making the heatmap more informative by:

  • adding a sidebar colour-coding the samples by experimental group
  • plotting the genes from components 1 and 2 separately
  • customising the legend
  • adjusting the size of labels/margins so that they fit (or not trying to display things that won’t fit)
legend=list(legend = levels(Y), # set of classes
            col = unique(color.mixo(Y)), # set of colours
            title = "Cancer Subtype", # legend title
            cex = 0.7) # legend size

cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp1", comp = 1, 
    row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE, ylab = "Samples")
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp2", comp = 2,
    row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE, ylab = "Samples")
sPLSDA Comp.1 heatmap
sPLSDA Comp.1 heatmap
sPLSDA Comp.2 heatmap
sPLSDA Comp.2 heatmap

9 Predictions

sPLS-DA models can be useful to find a good combination of distinctive features that are associated with the experimental groupings. These models can also be used to make predictions, classifying a new sample.

# Prepare test set data
data.test.mRNA <- breast.TCGA$data.test$mrna
Y.test <- breast.TCGA$data.test$subtype
table(Y.test)
## Y.test
## Basal  Her2  LumA 
##    21    14    35
predict.mRNA <- predict(tuned.splsda.mRNA, newdata = data.test.mRNA, 
                               dist = "centroids.dist")

predict.mRNA # List the different outputs
## 
## Call:
##  predict.mixo_spls(object = tuned.splsda.mRNA, newdata = data.test.mRNA, dist = "centroids.dist") 
## 
##  Main numerical outputs: 
##  -------------------- 
##  Prediction values of the test samples for each component: see object$predict 
##  variates of the test samples: see object$variates 
##  Classification of the test samples for each distance and for each component: see object$class
  • The output of predict() includes prediction scores for each sample x group.
  • We can check the accuracy of the classifier on a confusion matrix.
head(predict.mRNA$predict)
## , , dim1
## 
##          Basal      Her2        LumA
## A54N 0.8146634 0.2570645 -0.07172794
## A2NL 0.9710992 0.2744097 -0.24550891
## A6VY 0.7322610 0.2479279  0.01981105
## A3XT 0.8082378 0.2563520 -0.06458983
## A1F0 0.6502978 0.2388401  0.11086215
## A26I 0.7176443 0.2463073  0.03604847
## 
## , , dim2
## 
##          Basal        Her2        LumA
## A54N 1.0458605 -0.16729274  0.12143224
## A2NL 1.0783905  0.07747889 -0.15586937
## A6VY 0.8089302  0.10720334  0.08386649
## A3XT 1.0456035 -0.17932769  0.13372416
## A1F0 0.8087518 -0.05199884  0.24324707
## A26I 0.6536870  0.36369948 -0.01738646
confusion.mat.mRNA <- get.confusion_matrix(truth = Y.test, 
                predicted = predict.mRNA$MajorityVote$centroids.dist[,2])
confusion.mat.mRNA
##       predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal                 20                 1                 0
## Her2                   0                14                 0
## LumA                   0                 1                34
get.BER(confusion.mat.mRNA)
## [1] 0.02539683
  • The Balanced Error Rate (BER) suggests the accuracy is good (>95%)

10 Practice

  • Perform supervised clustering (sPLSDA) based on the miRNA data layer, to optimally separate the three subtypes, or
  • optimally separate the Basal subtype from the other two subtypes
  • there is miRNA data for the independent test samples (breast.TCGA$data.test$mrna), how accurately can your model classify them?

11 Discussion and Recap

  • Unsupervised vs Supervised clustering
  • PLSDA
  • sparse PLSDA
  • Stability
  • Predictions